Method and device for determining and outputting the similarity between two data strings

ABSTRACT

The present invention discloses a method and device for determining and outputting a similarity measure between two data strings each data string comprising data entities, comprising: receiving a first data string, receiving a second data string, which is characterized by determining consecutively following data entities in the first data string, determining the relative positions of the consecutively following data entities in the first data string, determining similar data entities with the same order in the second data string, determining the relative positions of the determined data entities in the second data string, determining a matching measure by determining how far the relative positions of data entities in the second data string match with the relative positions of consecutively following data entities in the first data string, and outputting a similarity measure which corresponds to the matching measure of at least one comparison result.

The present invention relates to the field of one-dimensional patternrecognition. It also relates to a robust pattern recognition forfault-tolerant string or sequence search applications with a minimizeduse of calculation power. More specifically the invention relates to asimple fault tolerant string comparing algorithm. The presentedinvention introduces a method and a device that measures the similarityof two sequences or strings in a very fault-tolerant manner. Theinvention is for special use in the field of mobile terminal devices.

In pattern recognition and signal processing information is classifiedaccording to general categories, all of them having different sets oftheories and algorithms to handle them. Two classifications of relevancefor this invention are:

a) The dimensionality of the signals to analyze:

-   -   One-dimensional signals are quite widespread. Examples are audio        signals or any time series of one variable.    -   Two-dimensional signals are e.g. images which have two spatial        dimensions.    -   Examples of three-dimensional signals are video signals which        add the time dimension to an image signal.

b) Discrete or continuous signals:

-   -   A signal can assume any real or complex value in each of its        dimensions or show only discrete (quantized) values. The        discretization can happen on two levels:    -   The signal is only known at discrete spots of one or more        dimensions, i.e. the signal is sampled.    -   The values of the signal are quantized.

The present invention preferably relates to discrete one-dimensionalsignals which are sampled and quantized.

The most common way to solve the problem of fault-tolerant search wasthe application of the so-called “edit distance” in combination withdynamic programming (DP). The edit distance is a measure or dimensionthat describes the distance between two strings S₁ and S₂. The distanceis defined to be the number of insertions, deletions and variations thatare required to transform e.g. a string S₁ into a string S2. Theserequired transformations are computed by applying the idea of dynamicprogramming, which is commonly based on a “divide-and-conquer”algorithm:

A divide-and-conquer algorithm subdivides the original problem intosmaller independent problem partitions and tries to find a solution forthose smaller partitions first. In the following steps thedivide-and-conquer algorithm tries to solve the problem for biggerpartitions by taking and combining the solutions of the already editedsmaller problem partitions. Finally, the algorithm tries to combinethese solutions to find a solution for the original problem without anypartitioning.

The implementation of an edit distance in combination with dynamicprogrananing (DP) is strictly dependent on the problem context it isapplied in. In most cases the implementation cannot be kept untouched ifthe problem context changes, because the implementation of the editdistance is usually specialized for a particular problem context.Especially the realization of a Dynamic Programming (DP) algorithm is inmost cases highly specialized for the particular problem. Two essentialproblems may arise in the context of DP:

-   -   1.) Not every problem can be partitioned so that it can be        solved with DP.    -   2.) The number of conceivable problem partitions is too great.        The runtime of DP would not be very convenient as an obvious        consequence.

In the context of a fault-tolerant search, the edit distance representsin combination with DP a specialized solution to measure the similarityof two strings S₁ and S₂. It is obvious that especially underconsideration of 2.) some preconditions need to be defined to apply theedit distance in an acceptable runtime if the following faults occurringduring comparison of S₁ and S₂ shall be tolerated:

-   -   Characters of the string may (however) be varied    -   Additional characters may (wherever) be inserted    -   Characters may (if any and wherever) be missing

Many possibilities exists to define preconditions to keep the runtimelow, but in most cases either the tolerance or the search accuracyincreases anyhow.

Such a method is e.g. disclosed in the European Patent Application EP 1093 109 A1 “Verfahren zum Erkennen und Auswahlen einer Tonfolge,insbesondere eines Musikstilcks”. In this document a method is disclosedto compare a first sequence of notes with a second sequence of notes, todetermine the similarity between said two sequences. The disclosedmethod is based on a median and on comparing the tone duration and thevariation of adjacent successive tones in both sequences.

So it is desirable to improve the present pattern recognition techniquesby increasing the robustness of existing string and sequence comparisontechniques.

It is further desirable to reduce the search time of an algorithm ascompared to present similarity determination algorithms.

This invention presents a general similarity measure that is easilyapplicable to a great number of independent problem contexts withoutmany changes. The runtime is only dependent on the number of charactersthe search string S is composed of. In contrast to an edit distance (incombination with DP) the runtime is only effected by the definedtolerance concerning variances, because It is not affected by atolerance in matters of insertions and deletions.

The present invention is based on comparing the order of two strings byuniquely identifying the entities in both strings and subsequentlycomparing the order of the unique entities. The order is identifiedpreferably by tuples comprising two or more entities. The tuples of theentities defme a relationship between the entities. The more identicaltuples can be found in both strings the higher the level of similarityof both strings is.

According to a first aspect of the present invention there is provided amethod for determining the similarity measure between two data stringsof data entities, values or elements. The method. comprises receivingtwo strings to be compared. The method is characterized by determiningconsecutively following data entities in said first data string,determining the relative positions of said consecutively following dataentities in said first data string, determining similar data entitieswith the same order in said second data string, determining the relativepositions of said determined data entities in said second data string,determining a matching measure by determining how far the relativepositions of data entities in said second data string match with therelative positions of consecutively following data entities in saidfirst data string, and outputting a similarity measure which correspondsto the matching measure of at least one comparison result..

The method further comprises determining the relative position of theconsecutively following data entities, preferably in all possible tuplesand determining the number of similar data entities, particularly tuplesin the second string, as a similarity level, and outputting saidsimilarity level. Preferably, the tuples can be pairs, triples,quadruples or n-tuples, wherein n defines a number of entities in thetuple. The present invention is based on relationships between differententities, preferably defined by said tuples, and therefore one-tuplescomprising only one entity can not be used in the present invention.

The reception of the first and second string of entities or values, canfurther comprise a preprocessing of at least one of said strings e.g. bysignal enhancing, quantifying, sampling or by retrieving said stringfrom a storage or any other signal or string source. The entities insaid string can represent any set of entities such as characters notes,digitized signals, music or gene sequences, values, chemicalcompositions and the like. Said second string of entities is usuallyretrieved from a pre-stored string library, but can also be retrievedfrom a second signal source to determine a correlation between twosignals.

By determining at least one tuple or a pair of consecutively followingdata entities in said first data string, a parameter of order in thefirst string is generated. It may be noted that (the tuples of)consecutive entities are not limited to tuples of directly consecutiveentities.

By determining the relative position of said (at least one tuple of)consecutively following data entities in said first data string, themethod defines an order of said entities. This order is very clear inthe case of 2-tuples, i.e. pairs, where it can be seen if one entity ispositioned before or after the other entity in the string.

Similar to the determination of data entities or tuples in the firststring, at least one similar sequence of data entities or tuples ofsimilar consecutively following data entities in said second data stringand the relative position of said similar consecutively following dataentities in said second data string are determined.

By determining a matching measure by comparing how far the relativepositions of the at least one tuple of similar consecutively followingdata entities in said second data string matches with the relativeposition of said at least one tuple of consecutively following dataentities in said first data string it can be determined how far theorder of the first string can be found in the second string. The ordercorresponds to a similarity of both strings, and in case that in bothstrings the same tuples can be found, and that all found tuples have thesame relation or succession of the entities, the whole strings have tobe identical.

Thereafter, the number of matching (n-tuples of) data entities which arematching in respect of their relative order with said determinedn-tuples in said first string, is determined in said second string. Saidnumber of data entities (tuples) represents a matching measure. Thematching measure describes the similarity of the order of the strings.It is clear that strings being shorter than the used n-tuple can not becompared with the present method. The present invention proposes amethod that measures the similarity of two sequences or strings S₁ andS₂ in a very fault-tolerant manner.

Finally, the method puts out a similarity measure which corresponds tothe matching measure of at least one comparison result. The similaritymeasure represents an additional quality in relation to the number offound matching tuples. The matching measure is e.g. dependent of theabsolute number of tuple in the strings, and the matching measure canrepresent a normalized value, to provide a string length independentsimilarity measure, level or value.

The outputting of the similarity level can comprise a transformationinto a digital form. The determined similarity level can be outputtedtogether with the first and/or the second string of entities. Thissingle output of the determined similarity level can e.g. be used in anonline monitoring application, wherein two signals are compared.

It outlines a new robust pattern recognition scheme and its applicationsbased on an ordering principle. The advantage of the method is that itshows fault-tolerance search capabilities in the realistic scenario andthat the search pattern and the data can account for:

-   -   Substitution errors: Entities at certain positions differ        between search pattern and reference pattern.    -   Insertion/deletion errors: The search pattern does not have the        same length as the reference pattern since some entities are        missing or are additionally in the reference pattern.

An example embodiment of said n-tuples are 2-tuples i.e. pairs. Thepresent method comprises the output of said determined similarity level.For a string with n entities a total of n(n−1)/2 pairs are possible anddetermined in their order in correspondence with their above mentionedposition in the string. The similarity level describes the similarity ofthe order of the strings. It is clear that in case of pairs, aone-element string of a single entity cannot be compared with thepresent method.

Yet another example embodiment of the method further comprises thedetermination of at least one error limit for at least one of saidentities. Further said error limit is considered during saiddetermination of said matching measure. The error limit can be auniversal error limit assuming that each value of one of said strings issampled with e.g. a one-bit sample uncertainty. The error limit can bean individual pre-stored error limit for only a small number ofentities.

This uncertainty can be considered during the generation of saidmatching tuples, i.e. two entities in said two strings are considered asbeing matching, if they differ less than said error limit. This firstapproach has the advantage that the number of tuples is not increasing,but has the drawback that directly consecutive values lying within saiderror limit may become indistinguishable. The first approach on basis ofthe entities has the advantage to be adapted for large error limits. Itis to be noted that the error limit should be considerably smaller thanthe range of possible values of said entities. The first approach isbest be used in case of strings of values with verydifferent entities,and large error limits. So tuples of entities are considered asmatching, if both values differ less than said universal error limit.

A second approach to consider said error limits can be to form a numberof tuples comprising all said values for an entity within said errorlimits. This leads to a large number of tuples, as for example an errorlimit of minus 1-bit lead to a combination of 4 possible pairs. In caseof pairs the number of possible pairs with a +/− 1-bit error limit isalready 9. This approach is the best for small tuples such as pairs, andsmall, error limits not more than +−1 digit, or individual error limitsfor each of said entities.

Said error limits can be universal error limits or systematic errors ase.g. caused by preprocessing or signal/string transfer. The error limitscan also be statistical error limits determined by error calculation orstatistical values. The error limits can be empirically determined errorlimits, that may be determined separately for each entity or range ofentities. In case of hummed music recognition empirically determinedvalues may be used to compensate for typical perceptualmisinterpretations of tone height, tone lengths or other properties ofsaid entities.

The error limits may be considered during said determination of saidmatching measure, by simply ignoring the error limits and consideringsaid entities within said error limits as being similar, identical orvalid. The error limits can also be considered during the determinationof said matching measure, as a value describing the actual distancebetween the values in both strings divided by a maximum error limit. Thematching measure can also be generated as a multi dimensional value,wherein the similarity is generated as a tuple or as a vector describingthe order criterion in one value and the “erroneousness” in anothervalue.. The resulting matching measure of the calculated similaritymeasure may be calculated as the arithmetic or geometric mean or a meansquare distance of said similarity value. Alternatively, other mappingfunctions defining upper limits for at least one of said values can beapplied.

Another example embodiment of the method further comprises the storingof said second string together with said similarity level and/or saidfirst string of entities. In case that more than one second strings arestored, the second strings can be sorted in accordance with theirrespective values of similarity, to simplify the access to a mostsimilar string.

Yet another example embodiment of the method further comprisesallocating a position label or value to each of the entities in thestring, and numbering same entities according to their relative positionin accordance with the position label, i.e. the first, second and third. . . occurrence of an entity.

The assignment or allocation of a position label to each of saidentities in each of said strings, is a prerequisite for the followingnumbering of equal entities In case that the entities are alreadyassigned with an internal string position indicating set of numerals theassignment can comprise the adoption of said position.

The same entities or values in a string are numbered consecutivelyaccording to their relative succession in said string according to saidpositioning labels, i.e. by first, second, third . . . occurrencelabels. By numbering consecutively identical entities in the string,according to the position labels in the string, each entity is uniquelyidentified. So the method provides an unambiguous order to the stringand to similar entities within said string. Thereby the same entities inthe string can be distinguished. Said numbering is carried out in eachof said strings respectively.

After the numbering, the relative order for all possible n-tuple ofunique entities (i.e. all possible combinations of two, three or fourentities) in one of the strings is determined. It may be noted that allpossible n-tuples of unique entities and not only n-tuples of adjacententities are determined in their relative order. By determining in saidsecond string the number of n-tuples of numbered entities which inrespect to their relative order are matching said determined n-tuples ofentities in said first string, the degree of similarity is determined.

The similarity level is related to the unique entities in the n-tuplesand to the succession of the entities in said string. The same relativesuccession analysis can also be performed in the second string for thedetermination followed by a comparing step of the two sets of n-tuples.The total amount of identical n-tuples in both strings defines thesimilarity level. This level can be standardized by dividing the totalamount of equal n-tuples by the maximum number of possible n-tuples ofentities in said first or said second string.

Another example embodiment of the method further comprises determiningthe distances between said two entities of said least one pair ofconsecutively following data entities in both said data strings anddetermining a difference between said first and second distances forsimilar pairs, and considering said difference during said determinationof said matching measure.

By determining the distance between said two entities of a pair entitiesin said first data string, said pair is further characterized by thenumber of entities between said two entities. Thereby an additionalparameter to describe the relation between two entities can begenerated. Two pairs are more similar, if the distance between itsentities is more similar. If both pairs of entities have the samedistance between their entities, it is clear that the pairs are moresimilar than if a pair of directly consecutive entities is compared witha pair of similar entities being spaced by 20 or more entities.

As illustrated in the above description of the error limit, thisadditional position difference parameter can be considered in saiddetermination of said matching measure. As stated above, the matchingmeasure can be generated as a tuple comprising the order characteristic,the error limits and the distance difference. An exemplary mappingfunction is described in the description of the figures.

Yet another example embodiment of the method further comprises thedetermination of a threshold for said similarity level and theoutputting said second string, if said determined similarity level atleast equals said threshold. By using a threshold, the rejection of toodifferent strings can be simplified. The threshold can be received froma user or from a storage. The threshold can be differently selected foreach of the second strings, to provide an evaluation of the similarityof said second string. This can be a useful feature for the search of afirst small string in large second strings.

Yet another example embodiment of the method further comprises:repeating said determination of the similarity level with a number ofsecond strings and determining said threshold in correspondence with amaximum or minimum number of second strings to be outputted. By usingsuch an adaptive threshold the number of outputted values can be definedin accordance with a pre-selectable number of most similar strings.

Another example embodiment the method further analyzes the first stringfor entities not present and suppresses or deletes all said entities inthe second string, that are not present in said first string. Bysuppressing of entities in the second string the recognition process canbe made faster. The suppression leads to a fragmentation of the secondstring. Similarly, the entities in the first string that are not presentin the second strings can also be deleted for the present method tospeed up the recognition process. This can be done by determining afirst set of entities of the first string, and determining all entitiesin the second string being not present in the first string. As bothsequences are quantized, the missing entities can easily be determined.This deletion of elements in the first and/or in the second string leadsto a consideration of only the subset of entities of the common membersof the entities of the first and the second string.

Yet another example embodiment of the method further determines thenumber of entities that are present in the second string, but that arenot present in the first string, as a second similarity level. Thisadditional value can also be standardized and can serve as an additionalvalue that can e.g. be multiplied with said first similarity value. Thenumber of deleted entities can be divided by the total amount ofentities in the first or the second string to provide an evaluation ofthe number of deleted entities in.

Another example embodiment of the method further comprises determining afraction within said second string comprising at least the same numberof common entities that are simultaneously present in both strings. Thenumber can be determined by the total number of entities within thesecond string or by the respective number of different entities in thefirst string. By using such a process to determine the length of thesecond string it can be assured that errors in the second string causedby additional entities not present in the first string can be filteredout.

Yet another example embodiment of the method further marks all tuples ofnumbered entities being identical to said determined tuples of said insaid first string having the same succession as in the first or secondstring and stores said string with said marked tuples. By marking theentities in the string the sections within the second string having highsimilarity to the first can be marked to determine the similar stringswithin said second string.

It should be noted that in the case that the first string is small andthe second string is large, the second string can be pre-processed by anauto-similarity algorithm, to detect repeated sub-strings within thesecond string. This feature can be especially useful in the case ofmusic searches from a tone string. The second string can be searched forsimilar strings such as refrains to cut down the whole piece of music toat least the half.

According to yet another aspect of the invention, a software tool isprovided comprising program code means for carrying out the method ofthe preceding description when said program product is run on a computeror a network device.

According to another aspect of the present invention, a computer programproduct downloadable from a server for carrying out the method of thepreceding description is provided, which comprises program code meansfor performing the preceding methods when said program is run on acomputer or a network device.

According to yet another aspect of the invention, a computer programproduct is provided comprising program code means stored on a computerreadable medium for carrying out the methods of the precedingdescription, when said program product is run on a computer or a networkdevice.

According to another aspect of the present invention a computer datasignal is provided. The computer data signal is embodied in a carrierwave and represents a program that makes the computer perform the stepsof the method contained in the preceding description, when said computerprogram is run on a computer or a network device.

According to yet another aspect of the present invention an electronicdevice for determining the similarity between two data strings of dataentities is provided. The electronic device comprises a component forreceiving, a processing unit and an interface.

The component for receiving is for receiving a first string of entitiesand a second string of entities. The component for receiving can beembodied as a single interface or input module or as two differentinterface modules for receiving said two strings. The interface modulecan further comprise a quantifier such as an analog digital converter toprovide a string even from an arbitrary analogue signal. The componentfor receiving can also comprise an interface to receive said first orsecond signal from a storage.

The processing unit is connected to said receiving component. Theprocessing unit is configured to determine at least (one tuple of)consecutively following data entities in both of said strings. Saidprocessing unit is further configured to determine the relative positionof said at least one tuple of consecutively following data entities inboth of said data strings. Said processing unit is configured todetermine a matching measure by comparing how far the relative positionsof the at least one tuple of similar consecutively following dataentities in said second data string matches with the relative positionof said at least one tuple of consecutively following data entities insaid first data string. Said processing unit is further configured tooutput a similarity measure which corresponds to the matching measure ofat least one comparison result. The processing unit can further beconfigured to execute the processes of the preceding description.

The interface is connected to said processing unit for outputting saidsimilarity level. The interface can further serve for outputting saidfirst and/or second string. The interface can also serve for outputtingsaid strings.

Yet another example embodiment of the electronic device furthercomprises a storage that is connected to said processing unit forstoring received strings and said determined levels of similarity. Thestorage can also be used to retrieve pre-stored strings from.

This implementation is (with some transformations) also applicable inall technical fields wherein strings of quantized values has to becompared with another. For example the following exemplary fields ofapplications among of others, which are explained more in detail in thedescription of the figures:

-   -   Associative text string search    -   Genome analysis    -   Speech recognition    -   Musical melody search

In the following, the invention will be described in detail by referringto the enclosed drawings in which:

FIG. 1A visualizes the recognition of a pumped sequence, not affectingthe recognition result,

FIG. 1B visualizes the recognition of a pumped sequence and the effectsto the recognition result,

FIG. 1C is an example for a similarity algorithm and the effects ofdifferent values in both sequences,

FIG. 1D is an example for sequence fragmentation according to a searchsequence,

FIG. 2 depicts the transformation of an arbitrary sequence S₁ into aunique value notation A_(S) ₁ ,

FIG. 3 depict two sequences that are to be analyzed for theirsimilarity,

FIG. 4 shows a sequence A_(S.) defining an order based on the positionof the elements in the sequence,

FIG. 5 visualizes an order application,

FIG. 6 is an example for the fragmentation of a sequence,

FIG. 7 is an example for the application of the present invention onstrings that are probable faulty,

FIG. 8 depicts an example of an improved similarity measure whichconsiders the different possible distances between the entities in thetuples,

FIG. 9 is an example of a method using the principles of fault toleranceand entity distance according to FIGS. 7 and 8 in one embodiment, and

FIG. 10 shows a hidden Markov model for speech recognition of word“sound” with state transitions.

FIG. 1A visualizes the recognition of a pumped sequence, not affectingthe recognition result and FIG. 1B visualizes the recognition of apumped sequence and the effects to the recognition result. The presentedinvention introduces a method that measures the similarity of twostrings or sequences S₁ and S₂ in a very fault-tolerant manner. If thesub-string S₁ shall be recognized in S₂, this method is capable ofrecognizing similar (respectively identical) string structures in S₂.The invented method is very robust due to insertions and deletions.Additional (and vice versa missing) characters are tolerated in such amanner that a similar (or identical) sub-string can be pumped by blankinsertions (s. FIG. 1A). or by non blank insertions (s. FIG. 1B) but itis still recognized as similar (respectively identical) sub-string.

FIG. 1C depicts a similarity measure according to the present invention,which is based on the principle of order relationships. The elements ofa string S define an order by their fixed positioning within the stringS. The basic principle of the invented measure is based on the analysisif the relative positioning of two elements of S₁ is also recognizablein the compared string S₂. Two elements a_(i) and a_(j) of S₁ arerelated to each other if they are positioned “one after another”. Theorder relation between a_(i) and a_(j) points out information about theelementary order structure. This structure information is transferableto all elements (in S₁) positioned “one after another”, so that thesequence of S₁'s elements defines an order due to the position of theelements. The invented similarity measure describes therefore how manyelements of S₂ are also positioned “one after another” as defined in S₁.The more elements of S₂ are relatively positioned the same way as in S₁,the more similar are the sequences as a consequence.

This includes that the two sequences to be compared can have a differentlength, which increases speed of the recognition process. A method todetermine the length of the second string section to be compared will bepresented in the following in the discussion of FIG. 1C.

An often encountered problem is to find a sequence of (quantized)values, in the following called search string, in a large set ofinformation. Examples of search strings can be strings such as textstrings, melodies consisting of musical notes or genome sequences. Thesestrings are transformed into a sequence by adding an index to eachsymbol which is equal to the number of its repetitions since thebeginning of the string. The string S=(1 3 2 1 1 2) e.g. is transformedinto the sequence A_(S)=(1₁3₁2₁1₂1₃2₂).

FIG. 1C visualizes the set problem: Two sequences A_(S) ₁ and A_(S) ₂are compared where the search sequence A_(S) ₁ is shorter than thereference sequence A_(S) ₂ due to two insertions (4₁ and 4₂) in A_(S) ₂depicted in a bold double box. The similarity measure provides the sameresult, irrespective of the insertions since the order of the existingelements of A_(S) ₁ is completely preserved and visualized by the arrowsfor the mapping.

The order of subsequent symbols in the search string is compared withthe order of the same symbols in the database string. The differentinstances of one symbol are numbered (e.g. 3 ₁ for the first occurrenceof 3 and 3 ₂ for the second occurrence of 3) in order to distinguishbetween identical symbols in the strings. If the same order relationshipexists in search and database string the similarity count is increased,otherwise it is left unchanged. In the example of FIG. 1 this meansconcretely:

-   -   The similarity count is increased by 6 since in both strings 1 ₁        follows 3₁, 2₁ follows 3 ₁, 2 ₂ follows 3 ₁, 2 ₃ follows 3 ₁, 3        ₂ follows 3 ₁ and 2 ₄ follows 3 ₁.    -   Then the similarity count is increased by 5 since in both        strings 2 ₁ follows 1 ₁, 2 ₂ follows 1 ₁, 2 ₃ follows 1 ₁, 3 ₂        follows 1 ₁ and 2 ₄ follows 1 ₁.    -   The similarity count for orders after 1 ₁ follows the same rules        and adds further 4+3+2+1 to the similarity count.

The final similarity count is 21. It is normalized by the maximumachievable similarity count which is N(n−1)/2for a search string oflength n. This leads to a similarity range of [0,1]. In our example(n=7) the normalized similarity is 21/21=1 which shows maximumsimilarity. If you would do the same similarity computation for the textstrings “example” and “pleasure” which is 4/21 this shows thedifferences, but at the same time demonstrates that certain letters ofthe two strings have the same order. Therefore the similarity measure isnot zero. The similarity is of course zero if completely differentcharacter symbol sets are compared. The similarity is equally zero ifonly one common entity or character symbol occurs only once in one ofsaid sequences. In this case no matching pair of character symbols canbe found, as all possible pairs contain at least one character symbolnot present in the other sequence.

FIG. 1D shows an additional criterion to analyze two sequences for theirsimilarity: The fragmentation of a sequence (due to the characters of asearch string). Considering the situation that two strings S₁ and S₂shall be analyzed for their similarity: If now all characters of S₂ aredeleted that are contained in S₂ but not in S₁, it is obvious that theresulting string may be very fragmented, because many of the charactersmay have been deleted between the remaining characters. The stringS_(remaining) (that is containing the remaining characters) isequivalent to the intersection of both strings S₁ and S₂ withS_(remaining)=S₁∩S₂. It is also obvious that the sum of all positiongaps (caused by deleted characters) is correlating to the similarity ofS₁ and S₂. A position gap is the number of deleted characters betweentwo remaining characters. The lower the sum of positions gaps (i.e.deleted characters) the higher is the probability for a highersimilarity of S₁ and S₂. This criterion is also very robust due tofault-tolerances.

If the order measure is combined with the fragmentation measure, apowerful and fault-tolerant similarity measure is possible in a veryfast runtime, because both criteria can be implemented very efficiently.

FIG. 2 shows a. more detailed description of the search algorithmaccording to the present invention. Two sequences S₁ and S₂ will betransformed into the equivalent sequences A_(S) ₁ and A_(S) ₂ beforethey are analyzed for their similarity, because in contrast to S₁ and S₂the notation of A_(S) ₁ and A_(S) ₂ also takes the repetitions of anelement into account.

This transformation of S₁ into A_(S) ₁ represents a transformation of agroup of similar values to an unique notation of each of said values byallocating an index to each of the same values, so that each value inthe string can be recognized individually by its value and its positionindex.

The transformation is unambiguous but not a one-to-one, as from theposition in the string the index of each of the values can be derived,but the position of the value in the sequence can not solely be derivedfrom the value and its index.

FIG. 3 depicts two sequences A_(S) ₁ and A_(S) ₂ to be analyzed fortheir similarity. The sequences A_(S) ₁ and A_(S) ₂ are composed ofelements. The notation of every element is defined by the followingdefinition (1) of a sequence A_(S.) with unique elements a_(i)A_(S.)=a₁,a₂, . . . a_(1(S.)) with a₁ ε IN_(IN)=(IN,IN)Let a ₁=(f _(i) ,r(f _(i)))=(f _(i))_(r(f) _(i) ₎with f_(i) ε IN and r as repitition of f_(i)

The present algorithm of this similarity measure is based on an orderdefinition. The idea is very simple: Define a definite order accordingthe elements' position within the sequence.

FIG. 5 shows the definition of a given sequence of A_(S) ₁ defining aunique order. Above the line the position of each value within thesequence is depicted by the natural numbers 1 to 7. Below the lines theappearance of each single entity is numbered by an index of naturalnumbers, based on the positions of the entities.

The position p(a_(i)) of every element a_(i) defines the determinedposition of this element within the sequence:Let p(a _(i))=i with a _(i) ε A _(S.)It follows p(a ₁)<p(a ₂)< . . . <p(a _(1(S.)))By defining a fixed order position for every element, a relationshipbetween two elements is also given by the definition below, wherein twoelements are related to each other if the relative order is kept.

-   -   Let R be a relation that is defined as follows:        R={(a _(i) ,a _(j))|a _(i) ,a _(j) εA _(S.) with p(a _(i))<p(a        _(j))}

This relationship points out that two elements a_(i) and a_(j) of A_(S)₁ are related to each other if their relative positioning within A_(S) ₁is kept. The order relation between a_(i) and a_(j) points out aninformation about the elementary structure. This structure informationis transferable to all elements (in A_(S) ₁ ) positioned “one afteranother”, so that the sequence of A_(S) _(i) 's elements defines anorder due to the elements' positioning. The invented similarity measureanalyzes therefore how many elements of A_(S) ₂ are also positioned “oneafter another” as defined in A_(S) ₁ . The more elements of A_(S) ₂ are(relatively) positioned the same way as in A_(S) ₁ , the more similarare the sequences as a consequence.

In case of the above given definition two elements are related to eachother if the position of any arbitrary left element a_(i) is smallerthan the position of the right element a_(j) (with p(a_(i))<p(a_(j))).

The similarity of A_(S) ₁ and A_(S) ₂ can now easily be measured byapplying the order of A_(S) ₁ on A_(S) ₂ . By applying an order of asequence (e.g. A_(S) ₁ ) on another sequence (e.g. A_(S) ₂ ) it will bechecked how many order relationships defined by A_(S) ₁ are also kept bythe elements in A_(S) ₂ . The order of A_(S) ₁ defines as alreadymentioned a relationship between the elements' positions within thesequence A_(S) ₁ . The similarity measure according to the invention nowchecks how many times the order (defined by A_(S) ₁ ) can be detected inA_(S) ₂ . A relationship between two elements is kept when the relativepositioning of these elements is the same in A_(S) ₁ and A_(S) ₂ . Aposition transformation (from A_(S) ₁ to A_(S) ₂ ) is required tomeasure an order application.

Equation (1): showing the order application (here defined by A_(S) ₁ )into the context of A_(S) ₂ (incl. position transformation):Check  if  (p(a_(i)) < p(a_(j)))⋀(p(t(a_(i))) < p(t(a_(j))))${{{if}\quad a_{\{{i,j}\}}} \in {A_{S_{1}}\quad{and}\quad{t\left( a_{m} \right)}}} = \left\{ \begin{matrix}{{n\text{:}{\exists{n\quad{with}\quad a_{m}}}} = {{a_{n}\quad{and}\quad a_{n}} \in A_{S_{2}}}} \\{{{not}\quad{defined}\quad{if}\quad a_{m}} \notin A_{S_{2}}}\end{matrix} \right.$

A position transformation t(a_(m)) simply searches the position of anelement a_(m) (of A_(S) ₁ ) in A_(S) ₂ . Either the element a_(m) iscontained in A_(S) ₂ (at position n,) and t(a_(m)) returns the positionof a_(m) within A_(S) ₂ , or a_(m) is not contained in A_(S) ₂ andt(a_(m)) signalizes that the relationship criterion cannot be kept.

FIG. 5 visualizes the position transformation. The similarity measurefor two arbitrary sequences A_(S) ₁ and A_(S) ₂ (that is completelybased on the above explained ideas) is finally defined and summarized bythe definitions and equations below.

Definition of the order relationship:R={(a _(i) ,a _(j))|a _(i) ,a _(j) ε A _(S) ₁ with p(a _(i))<p(a _(j))}

Check if order relationship is kept for two elements:${\omega\left( {a_{i},a_{j}} \right)} = \left\{ \begin{matrix}{{1\text{:}\left( {a_{i},a_{j}} \right)} \in R} \\{0\text{:}{\left( {\left( {a_{i},a_{j}} \right) \notin R} \right)\bigvee\left( {a_{i} \notin A_{S_{1}}} \right)\bigvee\left( {a_{j} \notin A_{S_{1}}} \right)}}\end{matrix} \right.$

Summarize all checks (for all elements):${s\left( {S_{1},S_{2}} \right)} = {{\frac{1}{\frac{1}{2}{l\left( S_{1} \right)}\left( {{l\left( S_{1} \right)} - 1} \right.}{\sum\limits_{i = 1}^{{l{(S_{1})}} - 1}{\sum\limits_{j = {i + 1}}^{l{(S_{1})}}{\omega\quad\left( {{t\left( a_{i} \right)},{t\left( a_{j} \right)}} \right)\quad{with}\quad a_{\{{i,j}\}}}}}} \in A_{S_{1}}}$${{and}\quad{t\left( a_{m} \right)}} = \left\{ \begin{matrix}{{n\text{:}{\exists{n\quad{with}\quad a_{m}}}} = {{a_{n}\quad{and}\quad a_{n}} \in A_{S_{2}}}} \\{{{not}\quad{defined}\quad{if}\quad a_{m}} \notin A_{S_{2}}}\end{matrix} \right.$

This invention presents the function s(S₁,S₂) that measures thesimilarity of two sequences S₁ and S₂ as defined above. This functionreturns a value with the following properties:

s(S₁,S₂)=1: S₁ is in total accordance with S₂

s(S₁,S₂)=0: Absolutely no accordance between S₁ and S₂

0<s(S₁,S₂)<1: The value of s(S₁,S₂) increases in parallel to anincreasing similarity of S₁ and S₂

FIG. 6 depicts an example for the fragmentation of a sequence to reducethe processing time during the pattern recognition. The idea of theinvention has been implemented in a fault-tolerant search technique formusic sequences in multimedia databases. A hummed (and therefore faulty)melody is transformed to a complexity-reduced sequence of notes. Thisfaulty note sequence is then given to the fault-tolerant searchtechnique to search the music song the hummed melody belongs to. In thisexample embodiment of the present invention the recognition algorithmhas been extended to improve its efficiency:

The efficiency according runtime and search accuracy can significantlybe increased by an easy one-time preprocessing of the music database(the search algorithm will later search in). So far the notes of a songhave been stored in a sorted sequence of notes. This sequence was in sofar only dependent on the notes' position. If the structure of thestored sequence is changed in such a manner that the sequence is nowordered by the tone heights (followed by all positions the correspondingtone height occurs), the tone heights (and their positions) of a songcan be accessed in a very fast (approximately logarithmic) runtime. Thiskind of selective tone height access enables the possibility only tofocus on the relevant tone heights the search melody is composed of. All(non-relevant) tone heights that are not contained in the search melodycan easily be ignored for the similarity measure. If the notes of themusic database are stored in this manner, the notes need to bere-sequenced for a similarity analysis (after the relevant tone heightshave been appointed on the basis of the search melody), because theoriginal note sequence was ordered by the notes' position. There-sequencing process can easily be realized by using FIFO(first-in-first-out) queues. Every relevant tone height represents aFIFO queue that is composed of the corresponding tone height'spositions. If the FIFO queues are accessed in such a manner that alwaysthe FIFO queue with the lowest position (as first element) will beaccessed, the note sequence automatically will be re-sequenced in theexact same manner of the original sequence. Non-relevant tone heightswill be ignored.

If only relevant tone heights and their positions will be re-sequencedin the scope of a similarity measure, it is obvious that there-sequenced note sequence is fragmented, because non-relevant toneheights and all their positions are completely ignored:

If now the distances are calculated on the basis of two directlysequenced relevant tone height positions which are stored in the alreadyexplained FIFO queues it is clear that these distances represent ameasure for the fragmentation of a certain subsequence. Thisfragmentation also correlates with the probability for a highersimilarity. If a certain subsequence is only barely fragmented (relatingto the relevant tone heights of the search melody), it is clear thatthis subsequence is also a candidate for a higher similarity.

The elimination of non-occurring tone heights by the FIFO structure canbe transferred to other discrete-valued sequences to speed up the searchprocess.

If the described fragmentation criterion is combined with the alreadyexplained order criterion the search accuracy can be increasedsignificantly. The explained one-time preprocessing of the musicdatabase increases the search runtime. The runtime can alsosignificantly be decreased by a parallelization of the search process.All statements for tone heights (and their positions) due tofragmentation and preprocessing are also applicable for tone lengths.

FIG. 7 depicts an improved version of the order criterion for theapplication on strings that are probably faulty. The above outlinedmethods assume that both strings are faultless. In case that it can beassumed that one or both of the strings to be compared are more or lessfaulty, it may be necessary to extend the methods to a more faulttolerant manner. So the single entities of both strings may be providedwith a tolerance in the measurement of the single entities. It is e.g.clear that in the case of digitized values, there is always a one-bitelement of uncertainty in the digital values. It can therefor beestimated that every value in the strings comprise a one bituncertainty. This uncertainty can also be projected or mapped to one ofthe strings, by assuming that the first or the second string comprise atwo-bit uncertainty. In case of additional elements of uncertainty inthe strings, the number of uncertainties may additionally increase, e.g.because of signal transfer-, pre- or post-process actions. It is clearthat the respective uncertainty in both of the strings can be mapped tothe other respectively, e.g. by defining entity specific uncertainties,or universal uncertainty values for each entity of one of the sequences.In the figure there is depicted a single entity of string 2 a_(i) and asingle entity b_(i) of string 1. The entity b_(i) of string 1 iscomprised of an uncertainty value τ defining a number of toleratedentities b_(i)−τ, b_(i)+1−τ, . . . , b_(i)−1, b_(i), b_(i)+1, . . . ,b_(i)−1+τ, b_(i)+τ. It is to be noted that the 1 and the τ are used tosymbolize a one bit difference in the entity value of b_(i). In case ofmusic this may represent a halftone or something equivalent or (independence of the application) a certain (and maybe b_(i) dependent)value. It may further be noted that the τ uncertainty values may bedifferently weighed for the positive and negative side of the valueb_(i) resulting in a range of b_(i)τ to b_(i)+y. The uncertainty valuemay be calculated or be empirically determined by statistical orsystematic errors within the system or the strings.

FIG. 8 depicts an example of the present invention for an improvedsimilarity measure, considering the different possible distances betweenthe entities in the tuples. For an improved clarity it is assumed thatonly pairs of entities are used. The distance between the entities in apair can e.g. be defined by the number of entities between them or bythe number of steps between the entities or by the amount of thedifference of their absolute positions. The additional distanceparameter can be used to determine an additional similarity parameterfor matching pairs of entities in the two strings. This parameter can becombined e.g. in a 3-tuple, comprising the two entities and theirrespective distance. A possible notation could be (a_(i), 5, a_(j)),wherein the a_(i) refers to a first entity, the a_(j) refers to thesecond entity and the 5 refers to the distance or position difference ofsaid entities in their respective data string. In a simple embodiment ofthe present invention, the number of matching triples can directly bedetermined by the above comparing algorithm. More sophisticatedapproaches can consider the difference in the positions as a function ofthe absolute distance in one of said strings. This can be embodied by anadditional normalizing factor such as e.g. (2*difference of thedistances)/(sum of the distances of both strings). The exact equationfor the determination of the distance difference parameter is notimportant as long as it is assured that e.g. the division by zero isprevented. The exact equation for the determination of the distancedifference parameter can be selected in dependence of the actualsimilarity problem.

The depicted example depicts the very simple case wherein a pair ofdirectly consecutive entities in string S₂ is compared with not directlyconsecutive entities in String S₁. As stated above this simpleembodiment can easily be extended to arbitrary pairs or tuples ofentities.

FIG. 9 is an example of a method using the principles of fault toleranceand entity distance according to FIGS. 7 and 8 in one embodiment. If twostrings S₁ and S₂ are to be analyzed for their similarity using theintroduced similarity measure every element of S₁ and S₂ has to belocated on a unique position within S₁ and S₂. In some cases it isunfortunately not possible to comply with this strict requirement,especially in case of similarity measures for strings that are mostlikely containing faults. This is for example the case for musicretrieval systems that retrieve a music song just on the basis of asung, whistled or hummed search melody which will most likely containtone height or tone duration errors as well as inserted or omittedtones. The former unique element positions within the strings S₁ and S₂become fuzzy as a consequence and a fault-tolerance is to be taken intoaccount for the similarity measure. A fault-tolerant similarity measurefor two strings S₁ and S₂ can thus easily become a combinatorial problemdue to the fast increasing number of analytical steps. The followingsection introduces an improved measure that is adapted to the concernsof a fault-tolerance which is still based on the concept of orderrelationships. It is capable of tolerating:

1.) additional characters,

2.) omitted characters, and

3.) character deviations (i.e. different character values),

if two strings S₁ and S₂ are to be analyzed for their similarity. Thefollowing method has successfully been applied for music retrieval.

If two strings S₁ and S₂ are analyzed for their similarity one string(e.g. S₂) is assumed to be error-free. S₂ is then used to estimate theelement positions of the other string S₁: If an element a_(j) is locatedat position j in S₂ the corresponding element as of S₁ (witha_(i)=a_(j)) is expected to be located near to j in S₁ with i=j±ε. Themore similar the strings S₁ and S₂, the more subsequent elements a_(j)and a_(j+1) of S₂ are located similar in S₁ preserving the orderrelationship R_(S2) defined by S₂.

Let p(S₁,a_(j)) be the position of a_(j) (with a_(j) ε S₂)in S₁.p(S₁,a_(j)) is thus defined as${p\left( {S_{1},a_{j}} \right)} = \left\{ \begin{matrix}{i} & {{{:a} =_{\tau}a_{i}},{a_{i} \in {S_{1}}_{j}}} \\{{not}\quad{defined}} & {:{otherwise}}\end{matrix} \right.$

The equivalence (=_(r)) of two characters a₁ and a₂ defines theseelements to be equivalent in the scope of a defined τ-fault-tolerance:a₁=_(r) a₂ if ∃0≦t≦τ: a ₁ ±t=a ₂

The distance Δp(S₁,a_(j)) defines the position distance of twosubsequent characters a_(j) and a_(j+1) of S₂ in S₁:${\Delta\quad{p\left( {S_{1},a_{j}} \right)}} = \left\{ \begin{matrix}{{p\left( {S_{1},a_{j + 1}} \right)} - {p\left( {S_{1},a_{j}} \right)}} & {:{{p\left( {S_{1},a_{j + 1}} \right)} > {p\left( {S_{1},a_{j}} \right)}}} \\{\infty} & {:\left\{ \begin{matrix}{{{{p\left( {S_{1},a_{j + 1}} \right)}\quad{or}\quad{p\left( {S_{1},a_{j}} \right)}\quad{is}\quad{not}\quad{defined}},{or}}\quad} \\{{p\left( {S_{1},a_{j + 1}} \right)} \leq {p\left( {S_{1},a_{j}} \right)}}\end{matrix} \right.}\end{matrix} \right.$

A heuristic similarity measure s(S₁,S₂,R_(S) ₂ ) for two strings S₂ andS₁ can hence be defined as:${s\left( {S_{1},S_{2},R_{S_{1}}} \right)} = {\frac{\sum\limits_{j = 1}^{{l{(S_{1})}} - 1}\frac{1}{\Delta\quad{p\left( {S_{1},a_{j}} \right)}}}{{\max\limits_{1 \leq j \leq {l{(S_{2})}}}{p^{\prime}\left( {S_{2},a_{j}} \right)}} - {\min\limits_{1 \leq j \leq {l{(S_{1})}}}{p^{\prime}\left( {S_{2},a_{j}} \right)}}} \cdot \frac{n_{c}\left( {S_{1},S_{2}} \right)}{l\left( S_{1} \right)}}$if a_(j) ε S₂, and n_(c)(S₁,S₂) is the overall number of kept orderrelationships (of all subsequent characters of S₂ in S₁).$\max\limits_{1 \leq j \leq {l\quad{(S_{2})}}}{p^{\prime}\left( {S_{2},a_{j}} \right)}$is the largest position of all a_(j) in a string S₂ without anycomputation speed-up techniques as explained in the FIG. 6. p′(S₂,a_(j))thus reflects the position of a_(j) in the overall string S (with S⊃S₂)of which S₂ is a sub-string. As a consequence$\max\limits_{1 \leq j \leq {l\quad{(S_{2})}}}{{p^{\prime}\left( {S_{2},a_{j}} \right)}\quad{and}\quad{\min\limits_{1 \leq j \leq {l\quad{(S_{2})}}}{p^{\prime}\left( {S_{2},a_{j}} \right)}}}$are defined as:${\min\limits_{1 \leq j \leq {l\quad{(S_{2})}}}{p^{\prime}\left( {S_{2},a_{j}} \right)}} = {{1\quad{and}\quad{\max\limits_{1 \leq j \leq {l\quad{(S_{2})}}}{p^{\prime}\left( {S_{2},a_{j}} \right)}}} = {l\quad\left( S_{2} \right)}}$if no computation speed-up technique is applied as in FIG. 6.

The introduced heuristic similarity measure is capable of toleratingfaults that are caused by wrong characters, inserted and omittedcharacters. This measure is capable of recognizing a (faulty) stringclearly in a huge set of strings. The higher the value of s(S₁,S₂,R_(S)₂ ), the more similar are the compared strings. If a string S₂ is to beretrieved in a huge set of strings, the string S₁ with the highest values(S₁,S₂,R_(S) ₂ ) is the most potential candidate to be the most similarone to S₂. The concept of s(S₁,S₂,R_(S) ₂ ) is still based on theevaluation of kept order relationships.

The fuzzy similarity algorithm can be employed in a variety of technicalsystems where a sequence of symbols or quantized values is compared witha second sequence of symbols to find the locations of highestsimilarity. In the following four typical applications are sketched:

Associative Text String Search

The algorithm can be used to implement an associative text search sincenot only exact matches can be found. Those imperfect matches have asimilarity below 1. An end user could define the fault tolerance levelby adjusting a similarity threshold which has to be exceeded to report amatch: The user can start with perfect match (allowing insertions ofcharacters in the reference string, but no deletions) and if no resultis found lower the threshold to get more results. Alternatively thematches could be displayed sorted according to decreasing similarityshowing the most similar ones at first rank.

Musical Melody Search

A polyphonic piece of music consists of several different voices, eachof which is characterized as a sequence of notes. Those notes can beregarded as quantized values with tone height and length. The robustsimilarity measure is applied to the problem of detecting a hummed orsung melody (i.e. a sequence of notes) in a music database of severalpieces of music. Once a piece of music is found where the search melodyis found with high similarity it is presented to the user for playbackpurposes. This application is of high relevance for entertainmentterminals where the identification of pieces of music does not have tobe done by entering a text (i.e. title, composer or performer), but bysinging a prominent (e.g. the refrain) or an arbitrary part of themusic.

Genome Analysis

Genomic signal processing is often concerned with finding DNA(deoxyribonucleic acid) sequences on longer strands of DNA consisting ofthe adenosine, thymine, cytosine or guanine (A, T, C and G) nucleotides.Thus a DNA strand can be described as a sequence of the symbols A, T, Cand G. Identical, but also similar DNA sequences which differ at certainpositions can be found by the above mentioned robust search algorithm.This similarity comparison is even more pronounced in the case that thesequences are long, as is typically the case in DNA analysis withthousands of nucleotides. The application of the present invention tothe genome sequence identification is a good example for a searchalgorithm wherein the fragmentation has nearly no effect, as the lownumber of entities (4) makes it improbable, that a sequence occurs mothaving a nearly even distribution of nucleotides. The Genome analysis onthe other hand discloses aspect of the entities, instead of the 4nucleotides itself, the 20 nucleotide-coded amino acids can be selecteddo describe and compare the sequences, wherein one entity is representedby one or more 3-tuples of nucleotides. It may be noted that in the caseof genome analysis on the basis of the DNA, the speed up techniques likefragmentation and entity recognition can not be applied, as it is verylikely that e.g. in a 200 entity sequence, the single nucleotides aremore or less evenly distributed. So it can be assumed that fragmentationtechniques can not help to speed up the recognition process. In the caseof nucleotide based string analysis, it would also be not applicable toprovide a toleration of a faulty value for the single nucleotides,because it is not clear, which one is the “nearest” nucleotide for e.g.Tymine. In case of an RNA triplet analysis, the nucleotide triplets areeach coding an amino acid. The triplets can provide a sequence as4*4*4=64 triplets are coding only 20 different amino acids or commands.In this case, it is clear that e.g. UUA, UUG, CUU, CUC, CUA and CUG codethe same amino acid leucin and can therefore be regarded as equal. Toapply the present invention on these triplets, the pairs of amino acidsor triplets are chosen like the following (GAU, UAC) defining the orderof Aspargin followed by Tyrosine or as the pair (asp, tyr) to becompared in two RNA strings. Speech recognition (see FIG. 10).

FIG. 10 depicts an exemplary speech recognition application of thepresent search algorithm on the example of the hidden Markov model (HMMfor speech recognition of the word “sound” with state transitions. Inspeech recognition the phonetic sounds and their transitions in wordsare commonly modeled by hidden Markov models (HMM): During a short timewindow the spectral characteristics of are analyzed and mapped to themost similar phoneme. The spectral changes while articulating a wordlead to transitions between the corresponding HMM phoneme states.Normally, the word is recognized which shows the most similar statetransition sequence compared to the acoustical input. The computation ofthe most similar sequence of phonemes is often done by dynamicprogramming.

This is an ideal case for applying the fault-tolerant similarityalgorithm since it copes well with the time variations whilearticulating a word and substitutes the dynamic programming approach:Some of the sounds are encountered longer or shorter compared to thereference recordings (loop back arrows), the normal sequences->ou-2>n->d can be encountered or some sounds can even be skipped(arrows which bypass states). The number of states is on the order ofseveral dozen states per language, one state is considered to be stablefor at least 10 ms. The discrete nature of the phoneme states and thevariability of sounds fit very well to the robust order recognitionsearch paradigm of the present invention.

This application contains the description of implementations andembodiments of the present invention with the help of examples. It willbe appreciated by a person skilled in the art that the present inventionis not restricted to details of the embodiments presented above and thatthe invention can also be implemented in another form without deviatingfrom the characteristics of the invention. The embodiments presentedabove should be considered illustrative, but not restricting. Thus thepossibilities of implementing and using the invention are onlyrestricted by the enclosed claims. Consequently various options ofimplementing the invention as determined by the claims, includingequivalent implementations, also belong to the scope of the invention.

1. Method for determining and outputting a similarity measure betweentwo data strings each data string comprising data entities, comprising:receiving a first data string, receiving a second data string,characterized by determining consecutively following data entities insaid first data string, determining the relative positions of saidconsecutively following data entities in said first data stringdetermining similar data entities with the same order in said seconddata string, determining the relative positions of said determined dataentities in said second data string, determining a matching measure bydetermining how far the relative positions of data entities in saidsecond data string match with the relative positions of consecutivelyfollowing data entities in said first data string, and outputting asimilarity measure which corresponds to the matching measure of at leastone comparison result.
 2. Method according to claim 1, wherein pairs ofconsecutively following data entities are determined in said first datastring.
 3. Method according to claim 1, further comprising: determiningat least one error limit for at least one of said entities, consideringsaid at least one error limit during said determination of said matchingmeasure.
 4. Method according to claim 2, further comprising, allocatinga position label to each of said entities in the string, and numberingsame entities according to their relative position in accordance withthe position label.
 5. Method according to claim 2, further comprising:determining a first distance between said two data entities ofconsecutively following data entities in said first data string,determining a second distance of said two data entities determined insaid second data string, determine a difference between said first andsecond distances, and considering said difference during saiddetermination of said matching measure.
 6. Method according to claim 1,further comprising: storing said second string together with saidsimilarity measure.
 7. Method according to claim 1, further comprising:determining a threshold for said similarity measure, and outputting saidsecond string, if said determined similarity measure at least equalssaid threshold.
 8. Method according to claim 7, further comprising:repeating said determination of said similarity measure with a number ofsecond strings, and determining said threshold in correspondence with anumber of second strings to be outputted.
 9. Method according to claim1, further comprising: analyzing the first string for entities notpresent in the first string, and suppressing in the second string allsaid entities not present in said first string.
 10. Method according toclaim 9, further comprising: determining the number of entities that arepresent in the second string, but are not present in the first string,as a second similarity measure.
 11. Method according to claim 10,further comprising: determining a section within said second stringcomprising at least the same number of entities that are simultaneouslypresent in both strings.
 12. Software tool comprising program code meansstored on a computer readable medium for carrying out the method ofanyone of claims 1 to 12 when said software tool is run on a computer ornetwork device.
 13. Computer program product comprising program codemeans stored on a computer readable medium for carrying out the methodof anyone of claims 1 to 12 when said program product is run on acomputer or network device.
 14. Computer program product comprisingprogram code, downloadable from a server for carrying out the method ofanyone of claims 1 to 12 when said program product is run on a computeror network device.
 15. Computer data signal embodied in a carrier waveand representing a program that instructs a computer to perform thesteps of the method of anyone of claims 1 to
 12. 16. Electronic devicefor determining and outputting a similarity measure between two datastrings each comprising data entities, comprising: a component forreceiving a first string of entities and a second string of entities, aprocessing unit being connected to said receiving component, saidprocessing unit being configured to determine at least one tuple ofconsecutively following data entities in said first data string, saidprocessing unit being configured to determine the relative position ofsaid at least one tuple of consecutively following data entities in saidfirst data string, said processing unit being configured to determine atleast one tuple of similar consecutively following data entities in saidsecond data string, said processing unit being configured to determinethe relative position of said at least one tuple of similarconsecutively following data entities in said second data string, saidprocessing unit being configured to determine a matching measure bycomparing how far the relative positions of the at least one tuple ofsimilar consecutively following data entities in said first data stringmatches with the relative position of said at least one tuple of similarconsecutively following data entities in said second data string, andsaid processing unit being configured to output a similarity measurewhich corresponds to the matching measure of at least one comparisonresult, and an interface being connected to said for processing unit foroutputting said similarity measure.
 17. Electronic device according toclaim 17, further comprising a storage connected to said processing unitfor storing received strings and said determined similarity measures.